Supervised Learning Classification Project: AllLife Bank Personal Loan Campaign

Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn import metrics, tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score)

%matplotlib inline

Loading the dataset

In [3]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

data = pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')
df = data.copy()
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")
Mounted at /content/drive
There are 5000 rows and 14 columns in this dataset.

Data Overview

  • Observations
  • Sanity checks
In [4]:
df.columns
Out[4]:
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard'],
      dtype='object')
In [5]:
df.head()
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [6]:
df.dtypes
Out[6]:
ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIPCode                 int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal_Loan           int64
Securities_Account      int64
CD_Account              int64
Online                  int64
CreditCard              int64
dtype: object
In [7]:
df.isnull().sum()
Out[7]:
ID                    0
Age                   0
Experience            0
Income                0
ZIPCode               0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
dtype: int64
In [8]:
df.nunique()
Out[8]:
ID                    5000
Age                     45
Experience              47
Income                 162
ZIPCode                467
Family                   4
CCAvg                  108
Education                3
Mortgage               347
Personal_Loan            2
Securities_Account       2
CD_Account               2
Online                   2
CreditCard               2
dtype: int64
In [9]:
df.drop(['ID'], axis=1, inplace=True)
df.head()
Out[9]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [10]:
# Using pandas to get_dummies for one-hot encoding
cat_features = ['Family', 'Education']
model = pd.get_dummies(df, columns=cat_features)
model.head()
Out[10]:
Age Experience Income ZIPCode CCAvg Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Family_1 Family_2 Family_3 Family_4 Education_1 Education_2 Education_3
0 25 1 49 91107 1.6 0 0 1 0 0 0 0 0 0 1 1 0 0
1 45 19 34 90089 1.5 0 0 1 0 0 0 0 0 1 0 1 0 0
2 39 15 11 94720 1.0 0 0 0 0 0 0 1 0 0 0 1 0 0
3 35 9 100 94112 2.7 0 0 0 0 0 0 1 0 0 0 0 1 0
4 35 8 45 91330 1.0 0 0 0 0 0 1 0 0 0 1 0 1 0
In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal_Loan       5000 non-null   int64  
 9   Securities_Account  5000 non-null   int64  
 10  CD_Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [12]:
# Histogram for Mortgage distribution
plt.figure(figsize=(5, 5))
sns.histplot(df['Mortgage'], kde=True, bins=50)
plt.title('Distribution of Mortgage')
plt.xlabel('Mortgage Value')
plt.ylabel('Frequency')
plt.show()

# Boxplot to check for outliers in Mortgage
plt.figure(figsize=(5, 5))
sns.boxplot(x=df['Mortgage'])
plt.title('Boxplot of Mortgage')
plt.xlabel('Mortgage Value')
plt.show()

Distribution of Mortgage :

The histogram indicates that there is a right-skewed distribution for the 'Mortgage' attribute. Most customers have low or no mortgage values. There appears to be a much smaller number with higher mortgages. In addition, there are a significant number of outliers indicating that the there are individuals with significantly higher mortgages.

In [13]:
df['CreditCard'].value_counts()
Out[13]:
0    3530
1    1470
Name: CreditCard, dtype: int64

Number of Customers with Credit Cards:

There are 1,470 customers who have one or more credit card(s)

In [14]:
# Correlation with Personal Loan
correlation=df.corr()['Personal_Loan'].sort_values(ascending=False)

Attributes Strongly Correlated with Personal Loan:

Attributes closely linked to the acceptance of a personal loan include 'Income', 'CCAvg' (average spending on credit cards), 'CD_Account' (ownership of a certificate of deposit account), 'Mortgage', and 'Education'. These factors are significant in predicting the probability of a customer opting for a personal loan.

In [15]:
# Count plot for Loan Interest vs Age
plt.figure(figsize=(15, 6))
sns.countplot(x='Age', hue='Personal_Loan', data=df)
plt.title('Loan Interest vs Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

Loan Interest vs Age:

Although the trend is not substantial, the count plot for 'Age' shows that customers of many different ages are interested in personal loans.

In [16]:
# Count plot for Loan Interest vs Education
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Education', hue='Personal_Loan')
plt.title('Loan Interest vs Education')
plt.xlabel('Education')
plt.ylabel('Count');

Loan Interest v. Education:

The count plot of 'Education' suggests that customer interest in personal loans varies according to their educational backgrounds. It seems that individuals with advanced educational qualifications may demonstrate a marginally greater inclination towards personal loans.

In [17]:
df.describe().T
Out[17]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0
  • Note that it appears that experience is negative, and that is not possible. We will take the absolute value.
  • There is 5000 data points in each attribute. The mean age is approx. 45.33 We will continue to look a little closer at individual attributes and how they influence one another.
In [18]:
# Converted negative values in the 'Experience' column to absolute values
df['Experience'] = df['Experience'].abs()
In [19]:
#creating smaller dataframe
columns = df[['Experience', 'Age', 'Income', 'CCAvg', 'Mortgage']]

# Creating the subplot grid
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 10))
axes = axes.flatten()
for i, col in enumerate(columns):
    sns.histplot(df[col], bins=10, ax=axes[i], alpha=0.5, edgecolor='black', color='skyblue', kde=True)

    # Calculating mean and median
    mean_value = df[col].mean()
    median_value = df[col].median()

    # Adding lines for mean and median
    axes[i].axvline(mean_value, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_value:.2f}')
    axes[i].axvline(median_value, color='blue', linestyle='-', linewidth=2, label=f'Median: {median_value:.2f}')

    # Adding a legend
    axes[i].legend()

plt.tight_layout()  # Adjusts the plots to fit into the figure area.
plt.show()

Experience and age appear is be very similar. Income is right skewed as expected with the median sitting around 64k/yr. CC-avg is also right skewed, which makes sense from the perspective that most people only have 1-3 credit cards.

In [20]:
#See how many 0's are present in the mortgage
print("Number of customers with Mortgages at $0 is", df[df.Mortgage==0].shape[0])

print("Or, approximately", (df[df.Mortgage==0].shape[0]/df.Mortgage.shape[0])*100,"%")
Number of customers with Mortgages at $0 is 3462
Or, approximately 69.24 %

A very large amount of people (or 69.24%) do not have mortgages!! Interesting!

In [21]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, fmt=".1f")
plt.show()

Age and experience are seemingly congruent, we could drop one if necessary... Personal loan seems like a high correlation with Personal_Loan, and relatively high with CC-avg. There is some notable findings from CD_Account, but hard to understand if this is promising or just a coincidence.

Additionally, income and CC-avg are correlated, which is interesting.

In [22]:
top_zip_codes = df['ZIPCode'].value_counts().nlargest(20)
sns.barplot(x=top_zip_codes.index, y=top_zip_codes.values)
plt.title('Top 20 ZIP Codes by Number of Customers')
plt.xlabel('ZIP Code')
plt.xticks(rotation=45)
plt.ylabel('Number of Customers')
plt.show()
In [23]:
!pip install uszipcode
!pip install zipcodes

from uszipcode import SearchEngine
search = SearchEngine()

import zipcodes
Collecting uszipcode
  Downloading uszipcode-1.0.1-py2.py3-none-any.whl (35 kB)
Requirement already satisfied: attrs in /usr/local/lib/python3.10/dist-packages (from uszipcode) (23.2.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from uszipcode) (2.31.0)
Collecting pathlib-mate (from uszipcode)
  Downloading pathlib_mate-1.3.2-py3-none-any.whl (56 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.0/57.0 kB 3.2 MB/s eta 0:00:00
Collecting atomicwrites (from uszipcode)
  Downloading atomicwrites-1.4.1.tar.gz (14 kB)
  Preparing metadata (setup.py) ... done
Collecting fuzzywuzzy (from uszipcode)
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting haversine>=2.5.0 (from uszipcode)
  Downloading haversine-2.8.1-py2.py3-none-any.whl (7.7 kB)
Requirement already satisfied: SQLAlchemy>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from uszipcode) (2.0.24)
Collecting sqlalchemy-mate>=1.4.28.3 (from uszipcode)
  Downloading sqlalchemy_mate-1.4.28.4-py2.py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/77.1 kB 8.3 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy>=1.4.0->uszipcode) (4.5.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy>=1.4.0->uszipcode) (3.0.3)
Collecting SQLAlchemy>=1.4.0 (from uszipcode)
  Downloading SQLAlchemy-1.4.51-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.2 MB/s eta 0:00:00
Requirement already satisfied: prettytable in /usr/local/lib/python3.10/dist-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->uszipcode) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->uszipcode) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->uszipcode) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->uszipcode) (2023.11.17)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prettytable->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.13)
Building wheels for collected packages: atomicwrites
  Building wheel for atomicwrites (setup.py) ... done
  Created wheel for atomicwrites: filename=atomicwrites-1.4.1-py2.py3-none-any.whl size=6942 sha256=0e18f1667ae206bd01f4cf7b86b5d56d930da5b0e642d6e53555ba3a24573f4d
  Stored in directory: /root/.cache/pip/wheels/34/07/0b/33b15f68736109f72ea0bb2499521d87312b932620737447a2
Successfully built atomicwrites
Installing collected packages: fuzzywuzzy, SQLAlchemy, pathlib-mate, haversine, atomicwrites, sqlalchemy-mate, uszipcode
  Attempting uninstall: SQLAlchemy
    Found existing installation: SQLAlchemy 2.0.24
    Uninstalling SQLAlchemy-2.0.24:
      Successfully uninstalled SQLAlchemy-2.0.24
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.51 which is incompatible.
Successfully installed SQLAlchemy-1.4.51 atomicwrites-1.4.1 fuzzywuzzy-0.18.0 haversine-2.8.1 pathlib-mate-1.3.2 sqlalchemy-mate-1.4.28.4 uszipcode-1.0.1
Collecting zipcodes
  Downloading zipcodes-1.2.0-py2.py3-none-any.whl (719 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 719.6/719.6 kB 9.9 MB/s eta 0:00:00
Installing collected packages: zipcodes
Successfully installed zipcodes-1.2.0
/usr/local/lib/python3.10/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Download /root/.uszipcode/simple_db.sqlite from https://github.com/MacHu-GWU/uszipcode-project/releases/download/1.0.1.db/simple_db.sqlite ...
  1.00 MB downloaded ...
  2.00 MB downloaded ...
  3.00 MB downloaded ...
  4.00 MB downloaded ...
  5.00 MB downloaded ...
  6.00 MB downloaded ...
  7.00 MB downloaded ...
  8.00 MB downloaded ...
  9.00 MB downloaded ...
  10.00 MB downloaded ...
  11.00 MB downloaded ...
  Complete!
In [24]:
# Function to get state from a zipcode
def get_state_from_zip(zipcode):
    # Retrieve the zipcode object
    zipcode_info_1 = search.by_zipcode(zipcode)
    # Return the state attribute
    return zipcode_info_1.state if zipcode_info_1 else 'NaN'

def get_city_from_zip(zipcode):
    # Retrieve the zipcode object
    zipcode_info_2 = search.by_zipcode(zipcode)

     # Return the state attribute
    return zipcode_info_2.city if zipcode_info_2 else 'NaN'

# Apply the function to each row in the 'ZIPCode' column
df['State'] = df['ZIPCode'].apply(get_state_from_zip)
df['City'] = df['ZIPCode'].apply(get_city_from_zip)


print(df['State'].unique())

print(df['City'].nunique())
['CA' 'NaN']
245

Understanding more detail around Zip codes is important. Findings indicate, that the data is focused in California. There are approximately 245 cities in the data set.

In [25]:
# Identify the top 10 cities by frequency
top_cities = df['City'].value_counts().head(10)

# Plotting
top_cities.plot(kind='bar')
plt.title('Top 10 Cities by Frequency')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

LA has a high count followed by San Diego, San Francisco, and Berkley. This makes sense because they are both affluent areas and have large populations.

In [26]:
df_cat=df[['CreditCard','Personal_Loan','Family','Education','Securities_Account','CD_Account','Online']]
for columns in df_cat.columns:
    print(f"Counts for {columns}:")
    print(df_cat[columns].value_counts())
Counts for CreditCard:
0    3530
1    1470
Name: CreditCard, dtype: int64
Counts for Personal_Loan:
0    4520
1     480
Name: Personal_Loan, dtype: int64
Counts for Family:
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64
Counts for Education:
1    2096
3    1501
2    1403
Name: Education, dtype: int64
Counts for Securities_Account:
0    4478
1     522
Name: Securities_Account, dtype: int64
Counts for CD_Account:
0    4698
1     302
Name: CD_Account, dtype: int64
Counts for Online:
1    2984
0    2016
Name: Online, dtype: int64

Credit Card Ownership: A total of 5,000 individuals are represented, with 3,530 not owning a credit card (70.6%) and 1,470 owning a credit card (29.4%). Interest in Personal Loans: A small fraction, 480 out of 5,000 (9.6%), have shown interest in personal loans.

Family Size: The distribution of family sizes is relatively uniform, with the smallest family size (1 member) being the most common at 1,472 individuals (29.44%), followed by two-member families (1,296 or 25.92%), four-member families (1,222 or 24.44%), and three-member families being the least common (1,010 or 20.2%).

Education Level: Education level 1 (presumably the lowest level of education) is the most common, with 2,096 individuals (41.92%), followed by level 3 with 1,501 individuals (30.02%), and level 2 with 1,403 individuals (28.06%).

Securities Account Ownership: The majority, 4,478 (89.56%), do not have a securities account, while 522 (10.44%) do, indicating a low prevalence of securities account ownership.

CD Account Ownership: Similar to securities accounts, a small portion, 302 out of 5,000 (6.04%), own a CD account.

Online Banking Usage: A majority, 2,984 (59.68%), use online banking services, suggesting a preference or inclination towards the convenience of online banking among more than half of the individuals.

Overall, the data highlights several key trends: a general reluctance or lack of necessity for personal loans and securities accounts, a slight preference towards not having a credit car and a majority preference for using online banking services. These factors may be important later on when evaluating the big picture for target audience.

In [27]:
df_box=df[['Experience','Age','Income','CCAvg','Mortgage']]
plt.figure(figsize=(10,6))

# Creating a figure and a set of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10,10))

# Looping through the columns and creating box plots
for i, columns in enumerate(df_box):
    row_index = i // 2  #  division to get row index
    column_index = i % 2  # Remainder to get column index
    sns.boxplot(y=df[columns], ax=axes[row_index, column_index],color='blue',capprops=dict(color='blue'), showmeans=True)
    axes[row_index, column_index].set_title(f'{columns} Distribution');
<Figure size 1000x600 with 0 Axes>

Another view of the distributions of some of variables and the prominent skewing.

In [28]:
# Selecting a subset of variables for clarity
selected_columns = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Personal_Loan']
sns.pairplot(df[selected_columns], hue='Personal_Loan')
plt.show()

Looking at the correlations in a different way. We can see a spread with scatterplot. A very broad stroke appraoch to understanding of the distributions, for instance. Age adn experience, again, appear to be very well correlated. Income does appear to be a determining factor, or displays a sort of splitting point for those with mortgages and those without. Something to consider for later.

In [29]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Income', y='Mortgage',hue='Personal_Loan', data=df)
plt.title('Mortgage vs. Income')
plt.xlabel('Income')
plt.ylabel('Mortgage')
plt.show()

A closer look at income and mortgage, comments made earlier seem to be accurate with income being a bit of a deciding factor with mortgages.

In [30]:
sns.boxplot(x='Education', y='Income', data=df)
plt.title('Income Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Income')
plt.show()

Interestingly, income seems to be higher and wider spread in lower educated individuals. However, this may be influenced by the sheer amount of data and more equal distribution. It appears that there are a fair amount of outliers in the higher education groups.

In [31]:
sns.lineplot(x='Age', y='Experience', data=df)
plt.title('Age vs Experience')
plt.xlabel('Age')
plt.ylabel('Experience')
plt.show()
In [32]:
# Preparing the data
Edu_CC = df.groupby('Education')['CreditCard'].value_counts(normalize=True).unstack()

# Plotting
Edu_CC.plot(kind='bar', stacked=True)
plt.title('Credit Card Ownership by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Proportion')
plt.show()

Education and Credit cards do not appear to be a good indicator together.

In [33]:
df.groupby('Personal_Loan')['Income'].describe()
Out[33]:
count mean std min 25% 50% 75% max
Personal_Loan
0 4520.0 66.237389 40.578534 8.0 35.0 59.0 84.0 224.0
1 480.0 144.745833 31.584429 60.0 122.0 142.5 172.0 203.0

Interestingly much higher incomes for those with mortagages (both median and mode)

In [34]:
df_Loan_1 = df[df['Personal_Loan'] == 1]
df_Loan_0 = df[df['Personal_Loan'] == 0]
# Plotting scatter plots for Income of customers with and without personal loans
plt.figure(figsize=(12, 6))

# Scatter plot for customers with personal loans
sns.scatterplot(data=df_Loan_1, x=df_Loan_1.index, y='Income', color='blue', label='With Personal Loan')

# Scatter plot for customers without personal loans
sns.scatterplot(data=df_Loan_0, x=df_Loan_0.index, y='Income', color='red', label='Without Personal Loan')

plt.title('Income of Customers With and Without Personal Loans')
plt.xlabel('Customer Index')
plt.ylabel('Income')
plt.legend()
plt.show()

Again, a similar conclusion can be made, with high income being an indicator of personal loans.

In [35]:
# Creating values for the average (and median) income in each group
print('Mean income of customers with a loan', df_Loan_1['Income'].mean())
print('Median income of customers with a loan',df_Loan_1['Income'].median())
print('Mean income of customers without a loan',df_Loan_0['Income'].mean())
print('Median income of customers without a loan',df_Loan_0['Income'].median())
Mean income of customers with a loan 144.74583333333334
Median income of customers with a loan 142.5
Mean income of customers without a loan 66.23738938053097
Median income of customers without a loan 59.0
In [36]:
# Creating a histogram on the same chart to show the distribution of Income
plt.figure(figsize=(12, 6))

# Histogram for customers with personal loans
sns.histplot(df_Loan_1['Income'], color='blue', label='With Personal Loan', alpha=0.6, bins=30, kde=True)

# Histogram for customers without personal loans
sns.histplot(df_Loan_0['Income'], color='red', label='Without Personal Loan', alpha=0.6, bins=30, kde=True)

plt.title('Income Distribution of Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Another angle and distribution, attempting to understnad how the skew plays in. Note, that the personal loan population appears to have a more standard distribution and less skew.

In [37]:
# Histogram for customers with personal loans, normalized
sns.histplot(df_Loan_1['Income'], color='blue', label='With Personal Loan', alpha=0.6, bins=15, kde=True, stat="density")

# Histogram for customers without personal loans, normalized
sns.histplot(df_Loan_0['Income'], color='red', label='Without Personal Loan', alpha=0.6, bins=15, kde=True, stat="density")

plt.title('Normalized Income Distribution of Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Density')
plt.legend()
plt.show()

By creating a density plot, we can see the appearance of a 'breaking' point between the two... This may be interesting when looking at incorporating new customers.

In [38]:
# Calculate various percentiles of income for both groups
percentiles = [50,55, 60, 65, 70, 75, 80, 85, 90, 95]  # You can adjust these percentiles as needed
income_percentiles_loan = df_Loan_1['Income'].quantile(q=[p/100 for p in percentiles])
income_percentiles_no_loan = df_Loan_0['Income'].quantile(q=[p/100 for p in percentiles])

#displaying table:
print("Income Percentiles for Customers with Personal Loan:")
print(income_percentiles_loan)
print("\nIncome Percentiles for Customers without Personal Loan:")
print(income_percentiles_no_loan)

df_percentiles = pd.DataFrame({
    'Percentile': percentiles,
    'Income_with_Loan': income_percentiles_loan,
    'Income_without_Loan': income_percentiles_no_loan
})


# Plotting the data
plt.figure(figsize=(12, 6))
plt.plot('Percentile', 'Income_with_Loan', data=df_percentiles, marker='o', color='blue', label='With Personal Loan')
plt.plot('Percentile', 'Income_without_Loan', data=df_percentiles, marker='o', color='red', label='Without Personal Loan')
plt.title('Income Percentiles for Customers With and Without Personal Loans')
plt.xlabel('Percentile')
plt.ylabel('Income')
plt.xticks(percentiles)
plt.legend()
plt.show()
Income Percentiles for Customers with Personal Loan:
0.50    142.5
0.55    152.0
0.60    158.0
0.65    162.0
0.70    169.0
0.75    172.0
0.80    178.0
0.85    182.0
0.90    184.1
0.95    191.0
Name: Income, dtype: float64

Income Percentiles for Customers without Personal Loan:
0.50     59.0
0.55     63.0
0.60     70.0
0.65     75.0
0.70     81.0
0.75     84.0
0.80     93.0
0.85    109.0
0.90    124.0
0.95    150.0
Name: Income, dtype: float64

The trend of income vs. percentile is a good visual of where the general cut off or opportunity lies with those with high enough income to take on a loan.

In [39]:
# Breaking down the average CC-Avg in each group with and without a loan
print('Mean CC-Avg of customers with a loan', df_Loan_1['CCAvg'].mean())
print('Median CC-Avg of customers with a loan',df_Loan_1['CCAvg'].median())
print('Mean CC-Avg of customers without a loan',df_Loan_0['CCAvg'].mean())
print('Median CC-Avg of customers without a loan',df_Loan_0['CCAvg'].median())
Mean CC-Avg of customers with a loan 3.905354166666667
Median CC-Avg of customers with a loan 3.8
Mean CC-Avg of customers without a loan 1.7290088495575222
Median CC-Avg of customers without a loan 1.4

Customers with loans tend to have approximately two more credit cards! Which seems to indicate these individuals are willing to take on more debt

In [40]:
ccavg_percentiles_loan = df_Loan_1['CCAvg'].quantile(q=[p/100 for p in percentiles])
ccavg_percentiles_no_loan = df_Loan_0['CCAvg'].quantile(q=[p/100 for p in percentiles])

#updating the DataFrame for the CCAvg percentiles
df_percentiles['CCAvg_with_Loan'] = ccavg_percentiles_loan
df_percentiles['CCAvg_without_Loan'] = ccavg_percentiles_no_loan

#displaying table:
print("CC-Avg Percentiles for Customers with Personal Loan:")
print(ccavg_percentiles_loan)
print("\nCC-Avg Percentiles for Customers without Personal Loan:")
print(ccavg_percentiles_no_loan)
CC-Avg Percentiles for Customers with Personal Loan:
0.50    3.8000
0.55    4.1000
0.60    4.3000
0.65    4.6805
0.70    5.0000
0.75    5.3475
0.80    5.7000
0.85    6.1000
0.90    6.6000
0.95    7.4050
Name: CCAvg, dtype: float64

CC-Avg Percentiles for Customers without Personal Loan:
0.50    1.40
0.55    1.60
0.60    1.75
0.65    1.90
0.70    2.10
0.75    2.30
0.80    2.50
0.85    2.80
0.90    3.40
0.95    4.90
Name: CCAvg, dtype: float64

A more numerical representation to see the correlation that is evident within the split groups of those with loans and those without.

In [41]:
# Creating histograms for income distribution
plt.figure(figsize=(12, 6))
sns.histplot(df_Loan_1['CCAvg'], color='blue', label='With Personal Loan', alpha=0.6, bins=30, kde=True, stat="density")
sns.histplot(df_Loan_0['CCAvg'], color='red', label='Without Personal Loan', alpha=0.6, bins=30, kde=True, stat="density")
plt.title('Normalized CCAvg Distribution for Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Density')
plt.legend()
plt.show()

# Plotting the data
plt.figure(figsize=(12, 6))
plt.plot('Percentile', 'CCAvg_with_Loan', data=df_percentiles, marker='o', color='blue', label='With Personal Loan')
plt.plot('Percentile', 'CCAvg_without_Loan', data=df_percentiles, marker='o', color='red', label='Without Personal Loan')
plt.title('CCAvg Percentiles for Customers With and Without Personal Loans')
plt.xlabel('Percentile')
plt.ylabel('CCAvg')
plt.xticks(percentiles)
plt.legend()
plt.show()

Further visualtions to help us understand how the data is distributed. Certainly some difference compared to income. Perhaps more inclusive of the 'without loan' group, and maybe a good indicator of reliablilty with further information such as credit score!

In [42]:
df_percentiles.describe().T
Out[42]:
count mean std min 25% 50% 75% max
Percentile 10.0 72.5000 15.138252 50.0 61.250000 72.50000 83.750 95.000
Income_with_Loan 10.0 169.0600 15.392004 142.5 159.000000 170.50000 181.000 191.000
Income_without_Loan 10.0 90.8000 28.966647 59.0 71.250000 82.50000 105.000 150.000
CCAvg_with_Loan 10.0 5.3033 1.158876 3.8 4.395125 5.17375 6.000 7.405
CCAvg_without_Loan 10.0 2.4650 1.042979 1.4 1.787500 2.20000 2.725 4.900
In [43]:
# Making a copy of the DataFrame
df_Age = df.copy()

# Creating age groups and adding it as a new column
df_Age['Age_Group'] = pd.cut(df_Age['Age'], bins=[18, 24, 30, 36, 42, 50, 56, 62, 68], labels=['18-23', '24-29', '30-35', '36-41', '42-49', '50-55', '56-61', '62-68'])

# Grouping by Age_Group and counting the CreditCard usage
age_group_counts = df_Age.groupby('Age_Group')['CreditCard'].count()

# Print the counts
print(age_group_counts)

# Creating the bar plot with Seaborn
import seaborn as sns
sns.barplot(x=age_group_counts.index, y=age_group_counts.values)
plt.title('Credit Card Count by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Credit Card Count')
Age_Group
18-23      40
24-29     584
30-35     757
36-41     741
42-49    1008
50-55     789
56-61     779
62-68     302
Name: CreditCard, dtype: int64
Out[43]:
Text(0, 0.5, 'Credit Card Count')

A normal distribution is evident with CC count and age group!

In [44]:
# Calculating correlation matrix for the DataFrame
corr_matrix = df.corr()

# Sorting the 'Personal_Loan' column correlations
sorted_corr = corr_matrix['Personal_Loan'].sort_values(ascending=False)
print(sorted_corr)
Personal_Loan         1.000000
Income                0.502462
CCAvg                 0.366889
CD_Account            0.316355
Mortgage              0.142095
Education             0.136722
Family                0.061367
Securities_Account    0.021954
Online                0.006278
CreditCard            0.002802
ZIPCode              -0.002974
Age                  -0.007726
Experience           -0.008304
Name: Personal_Loan, dtype: float64
<ipython-input-44-0b1e1de8daa2>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr_matrix = df.corr()

A good visual of the correlations between the data attributes and personal loan in order of influence.

Data Preprocessing

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [45]:
#removing the add-in, state and city from the orginal data.
df.drop(['State','City'],axis=1,inplace=True)
df.describe()
Out[45]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000
mean 45.338400 20.134600 73.774200 93169.257000 2.396400 1.937938 1.881000 56.498800 0.096000 0.104400 0.06040 0.596800 0.294000
std 11.463166 11.415189 46.033729 1759.455086 1.147663 1.747659 0.839869 101.713802 0.294621 0.305809 0.23825 0.490589 0.455637
min 23.000000 0.000000 8.000000 90005.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 35.000000 10.000000 39.000000 91911.000000 1.000000 0.700000 1.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 45.000000 20.000000 64.000000 93437.000000 2.000000 1.500000 2.000000 0.000000 0.000000 0.000000 0.00000 1.000000 0.000000
75% 55.000000 30.000000 98.000000 94608.000000 3.000000 2.500000 3.000000 101.000000 0.000000 0.000000 0.00000 1.000000 1.000000
max 67.000000 43.000000 224.000000 96651.000000 4.000000 10.000000 3.000000 635.000000 1.000000 1.000000 1.00000 1.000000 1.000000
In [46]:
# breaking apart education and family into 1's and 0's using dummy variables.
df_dummies = pd.get_dummies(df, columns=['Education', 'Family'], drop_first=True)
df_dummies.head()
Out[46]:
Age Experience Income ZIPCode CCAvg Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard Education_2 Education_3 Family_2 Family_3 Family_4
0 25 1 49 91107 1.6 0 0 1 0 0 0 0 0 0 0 1
1 45 19 34 90089 1.5 0 0 1 0 0 0 0 0 0 1 0
2 39 15 11 94720 1.0 0 0 0 0 0 0 0 0 0 0 0
3 35 9 100 94112 2.7 0 0 0 0 0 0 1 0 0 0 0
4 35 8 45 91330 1.0 0 0 0 0 0 1 1 0 0 0 1
In [47]:
# Rechecking for null values.
df_dummies.isnull().sum()
Out[47]:
Age                   0
Experience            0
Income                0
ZIPCode               0
CCAvg                 0
Mortgage              0
Personal_Loan         0
Securities_Account    0
CD_Account            0
Online                0
CreditCard            0
Education_2           0
Education_3           0
Family_2              0
Family_3              0
Family_4              0
dtype: int64
In [48]:
#Checking data types
df_dummies.dtypes
Out[48]:
Age                     int64
Experience              int64
Income                  int64
ZIPCode                 int64
CCAvg                 float64
Mortgage                int64
Personal_Loan           int64
Securities_Account      int64
CD_Account              int64
Online                  int64
CreditCard              int64
Education_2             uint8
Education_3             uint8
Family_2                uint8
Family_3                uint8
Family_4                uint8
dtype: object

We will most likely not apply the dummy variables to the data, but we are utilizing them as a back-up if necessary for further data evaluation or model influences.

Model Building

Model Evaluation Criterion

More emphasis should be placed on recall since our primary objective is to predict whether the customer will accept a personal loan or not. The bank aims to encourage more customers to accept personal loans, which means reducing the number of False Negatives to ensure that the bank doesn't miss out on genuine customers who wish to borrow. Therefore, our main focus should be on improving recall.

Model Pre-processing

In [49]:
def model_performance_classification_with_confusion_matrix(model, predictors, target):

    # Predicting using the independent variables
    pred = model.predict(predictors)

    # Computing metrics
    accuracy = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # Creating a DataFrame of metrics
    df_performance = pd.DataFrame({"Accuracy": [accuracy], "Recall": [recall], "Precision": [precision], "F1": [f1]})

    # Calculate the confusion matrix
    cm = confusion_matrix(target, pred)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # Normalize for percentages
    labels = np.asarray([["{0:0.0f}\n{1:.2%}".format(value, percentage) for value, percentage in zip(cm.flatten(), cm_normalized.flatten())]]).reshape(2,2)

    # Plotting the confusion matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=labels, fmt='', cmap='Blues', cbar=False, xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.title('Confusion Matrix with Metrics')

    # Display performance metrics on the plot
    metrics_text = f'Accuracy: {accuracy:.1%}\nPrecision: {precision:.1%}\nRecall: {recall:.1%}\nF1 Score: {f1:.1%}'
    plt.text(2.5, 0.5, metrics_text, va='center', ha='center', bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=1'))
    return df_performance
In [50]:
## Define X and Y variables
X=df
y=df['Personal_Loan']
X = df.drop('Personal_Loan',axis=1)
X
Out[50]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 92697 1 1.9 3 0 0 0 1 0
4996 30 4 15 92037 4 0.4 1 85 0 0 1 0
4997 63 39 24 93023 2 0.3 3 0 0 0 0 0
4998 65 40 49 90034 3 0.5 2 0 0 0 1 0
4999 28 4 83 92612 3 0.8 1 0 0 0 1 1

5000 rows × 12 columns

Modeling (Includes Performance Testing)

In [51]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
In [52]:
# Initialize the Decision Tree Classifier with a random state for reproducibility
clf = DecisionTreeClassifier(random_state=1)

# Train the model on the training data
clf.fit(X_train, y_train)

# Predict the target values for the testing set
y_test = clf.predict(X_test)

# Correcting the function call with the right variable names
performance_df = model_performance_classification_with_confusion_matrix(clf, X_test, y_test)
print(performance_df)
   Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0

As expected, running a standard performance test results in overfit data which effectively counts the true yes's and true no's

In [53]:
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))

out = tree.plot_tree(clf, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [54]:
importances = clf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

#scaling data as LR is effected by outliers
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)
Out[55]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [56]:
# Evaluating performance on the training set
performance_df_train_LR = model_performance_classification_with_confusion_matrix(model, X_train_scaled, y_train)
print("Training Set Performance:\n", performance_df_train_LR)

# Evaluating performance on the test set - Use the scaled test data here
performance_df_test_LR = model_performance_classification_with_confusion_matrix(model, X_test_scaled, y_test)  # Corrected to use X_test_scaled
print("Test Set Performance:\n", performance_df_test_LR)
Training Set Performance:
    Accuracy    Recall  Precision        F1
0  0.953429  0.640483   0.828125  0.722317
Test Set Performance:
    Accuracy    Recall  Precision        F1
0  0.948667  0.586207   0.833333  0.688259

With LR, although accuracy scores are quite high, recall is low. Model is an improvement from "un-optimized", but not ideal.

In [57]:
# Grid Search for the best model parameters
param_grid_1 = {
    'max_depth': np.arange(3,21,3).tolist() + [None],  # Adding None to allow for unlimited depth
    'min_samples_split': np.arange(2, 21, 2).tolist(),  # Ranges from 2 to 20, stepping by 2
    'min_samples_leaf': np.arange(1, 11, 1).tolist(),  # Ranges from 1 to 10, stepping by 1
    'max_leaf_nodes': np.arange(10, 101, 10).tolist() + [None]  # Ranges from 10 to 100, stepping by 10, and None
}

param_grid_2 = {
    "max_depth": np.arange(3,21,3).tolist() + [None],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}

clf_pre_pruned = DecisionTreeClassifier(random_state=1)
grid_search_1 = GridSearchCV(clf_pre_pruned, param_grid_1, cv=5, scoring='recall')  # Focus on recall to catch as many positives as possible
grid_search_1 = grid_search_1.fit(X_train, y_train)

grid_search_2 = GridSearchCV(clf_pre_pruned, param_grid_2, cv=5, scoring='recall')  # Focus on recall to catch as many positives as possible
grid_search_2 = grid_search_2.fit(X_train, y_train)

print("Best parameters found by grid search:", grid_search_1.best_params_)
print("Best parameters found by grid search:", grid_search_2.best_params_)

estimator_1 = grid_search_1.best_estimator_
estimator_2 = grid_search_2.best_estimator_

# Fit the best algorithm to the data.
estimator_1 = estimator_1.fit(X_train, y_train)
estimator_2 = estimator_2.fit(X_train, y_train)

print(estimator_1)
print(estimator_2)
Best parameters found by grid search: {'max_depth': 6, 'max_leaf_nodes': 20, 'min_samples_leaf': 3, 'min_samples_split': 12}
Best parameters found by grid search: {'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 1e-06, 'splitter': 'best'}
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=20, min_samples_leaf=3,
                       min_samples_split=12, random_state=1)
DecisionTreeClassifier(max_depth=9, min_impurity_decrease=1e-06, random_state=1)

Altough pre-prune 2 (grid search parameters 2) has a relatively deep depth (9) compared to pre-prune 1 (grid search parameters 1; 6), it may still be the best model. Although, if the performance is close, the more simpler pre-pruned model may be ideal.

In [58]:
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))

out = tree.plot_tree(estimator_1, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [59]:
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))

out = tree.plot_tree(estimator_2, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [60]:
Optimized_pre_prune_1_train = model_performance_classification_with_confusion_matrix(estimator_1, X_train, y_train)*100
Optimized_pre_prune_2_train = model_performance_classification_with_confusion_matrix(estimator_2, X_train, y_train)*100

print("Train Performance of Estimator 1:\n", Optimized_pre_prune_1_train,"\n")
print("Train Performance of Estimator 2:\n", Optimized_pre_prune_2_train)
Train Performance of Estimator 1:
     Accuracy     Recall  Precision         F1
0  99.228571  93.957704  97.798742  95.839753 

Train Performance of Estimator 2:
     Accuracy     Recall  Precision         F1
0  99.914286  99.093656      100.0  99.544765

Both models appear to be quite good. However, pre-pruning 2 minimizes the True yes's predicted as No's better.

In [61]:
Optimized_pre_prune_1_test = model_performance_classification_with_confusion_matrix(estimator_1, X_test, y_test)*100
Optimized_pre_prune_2_test = model_performance_classification_with_confusion_matrix(estimator_2, X_test, y_test)*100

print("Test Performance of Estimator 1:\n", Optimized_pre_prune_1_test)
print("Test Performance of Estimator 2:\n", Optimized_pre_prune_2_test)
Test Performance of Estimator 1:
     Accuracy     Recall  Precision         F1
0  99.133333  93.793103  97.142857  95.438596
Test Performance of Estimator 2:
    Accuracy     Recall  Precision         F1
0      99.6  97.931034  97.931034  97.931034

The test data is very comparable to the training data in both pre-pruning grids, which apparent that these are excellent models!

In [62]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced")
    clf.fit(X_train, y_train)
    clfs.append(clf)


#developing recall testing on training data
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

#developing recall testing on test data
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)

#scores
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

#plot
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

It appears the train and test data both have high recall with a low alpha (which is difficult to see from the visual and must be determined what the best model is based on the alpha).

In [63]:
best_model = np.argmax(recall_test)
best_model = clfs[best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.2953792759992316, class_weight='balanced',
                       random_state=1)

The ideal alpha appears to be ~0.29. We could also run a different class weight (or remove), but we will determine if it is necessary based on the results of the performance test of the 'best model'

In [64]:
Optimized_post_prune_train = model_performance_classification_with_confusion_matrix(clf, X_train, y_train)*100
Optimized_post_prune_test = model_performance_classification_with_confusion_matrix(clf, X_test, y_test)*100

print("Test Performance of Training Post Prune:\n", Optimized_post_prune_train)
print("Test Performance of Test Post Prune:\n", Optimized_post_prune_test)
Test Performance of Training Post Prune:
    Accuracy  Recall  Precision         F1
0  9.457143   100.0   9.457143  17.280084
Test Performance of Test Post Prune:
    Accuracy  Recall  Precision         F1
0  9.666667   100.0   9.666667  17.629179

Although recall is maximized. All results have been predicted as 'yes' and therefore the data is overfitting.

In [65]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    clf,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

Model Comparison and Final Model Selection

In [66]:
# training performance comparison

# Concatenate the DataFrames along columns (axis=1)
models_train_comp_df = pd.concat(
    [
        performance_df_train_LR.T,
        performance_df.T,
        Optimized_pre_prune_1_train.T,
        Optimized_pre_prune_2_train.T,
        Optimized_post_prune_train.T
    ],
    axis=1,
)

models_train_comp_df.columns = [
    "Logistic Regression Train",
    "Decision Tree Unadjusted",
    "Decision Tree (Pre-Pruning 1) Train",
    "Decision Tree (Pre-Pruning 2) Train",
    "Decision Tree (Post-Pruning) Train"
]

print("Train performance comparison:")
models_train_comp_df
Train performance comparison:
Out[66]:
Logistic Regression Train Decision Tree Unadjusted Decision Tree (Pre-Pruning 1) Train Decision Tree (Pre-Pruning 2) Train Decision Tree (Post-Pruning) Train
Accuracy 0.953429 1.0 99.228571 99.914286 9.457143
Recall 0.640483 1.0 93.957704 99.093656 100.000000
Precision 0.828125 1.0 97.798742 100.000000 9.457143
F1 0.722317 1.0 95.839753 99.544765 17.280084
In [67]:
# test performance comparison

# Concatenate the DataFrames along columns (axis=1)
models_test_comp_df = pd.concat(
    [
        performance_df_test_LR.T,
        performance_df.T,
        Optimized_pre_prune_1_test.T,
        Optimized_pre_prune_2_test.T,
        Optimized_post_prune_test.T
    ],
    axis=1,
)



models_test_comp_df.columns = [
    "Logistic Regression",
    "Decision Tree Unadjusted",
    "Decision Tree (Pre-Pruning 1) Test",
    "Decision Tree (Pre-Pruning 2) Test",
    "Decision Tree (Post-Pruning) Test"
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[67]:
Logistic Regression Decision Tree Unadjusted Decision Tree (Pre-Pruning 1) Test Decision Tree (Pre-Pruning 2) Test Decision Tree (Post-Pruning) Test
Accuracy 0.948667 1.0 99.133333 99.600000 9.666667
Recall 0.586207 1.0 93.793103 97.931034 100.000000
Precision 0.833333 1.0 97.142857 97.931034 9.666667
F1 0.688259 1.0 95.438596 97.931034 17.629179

Based on the recall scores (primarily) in the test and training data model results. It has been established that Pre-Pruning 2 model is the best model. This model utilizes criterion, impurity and splitter hyperparameters, which led to the most optimal model.

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?

Business Insights

The Bank should target customers with higher incomes : The analysis showed that customers with higher income levels are more likely to accept personal loans. Marketing efforts could be more focused on higher income individuals and families.

The Bank should target customers with higher CC-Avg: Similarly, customers with higher average credit card count (CC-Avg) demonstrated a higher likelihood of taking personal loans. These customers are likely more comfortable with debt management.

Consider Customer's Education Level: Based on the modeling data, customers with a higehr education level are more likely or are more influenced in potentially taking out a loan. Whether there is more comfort from education or education requires loans, this is a great area to target.

Utilize Customer's Banking Behaviors: Customers who already have a securities account / CD account with the bank show higher conversion rates to loans. These services indicate a customer's trust and relationship with the bank, which could be utilized as a promotional offering for personal loans

Consider Age and Experience: While age and experience alone didn't show a clear advantage for personal loan acceptance, combining these with other factors like income and education could help in segment the customer clusters better.

Recommendations

Consider Developing marketing campaigns that focus on the customers identified through the model. Further clustering analysis may help to segment these customers better, however, utilizing data on income and targeting these customers, education, CC-avg, or those in school may be promising. Utilize variables such as age and other factors (potentially credit history) to help establish more detailed guidelines.

  • At this point, it is hard to determine the exact cause of 'why' those who are likely capable are avoiding loans. However, one of the major factors is likely education on financial management, interest rates, benefits, etc. If information on advantages can be provided (such as a booth at the local school) there may be more investment in focus groups.

Begin collecting more data on other factors that effect loan acquisition, such as: credit history (if possible), interest rates compared to to loan acceptance, perhaps compounding interest differences compared to new accounts.

In addition, perhaps there are opportunities for refinancing of current debt or consolidating debt for new graduate or other segments.

Tailor personal loan offers based on the customer's financial profile. For example, offer competitive interest rates or flexible repayment terms for high-income customers or those with a good track record of credit card usage.

Continuously monitor the performance of the personal loan campaigns and the predictive models. Adapt strategies as the overall economy changes. Economic changes likely have a major impact on loan acquisition.